Setup

Part 1: Data

We have acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset includes information from Rotten Tomatoes and IMDb for a random sample of movies.

We’re interested in learning what attributes make a movie popular.

The data set is comprised of 651 randomly sampled movies produced and released before 2016, and scored on both Rotten Tomatoes and IMDb.

We deal with a retrospective observational study with random sampling and with no random assignment here. Which means that we cannot make causal conclusions based on this data. However, we can use this data for making correlation statements, and the results of our findings can be generalized to the whole population of movies released before 2016 and scored on both Rotten Tomatoes and IMDb.


Part 2: Research question

Without watching a movie, can we say wether the audience love it?

Obviously, the popularity of a movie is mainly based on its content and the art value it offers. Those two factors are very hard if not impossible to calculate, though. However, there are certain characteristics of a movie that we can calculate and classify comparatively easy. Among others, those are genre, duration, and a cast quality. Using those and other variables from the data frame, we’ll try creating a prediction model for a movie papularity among the audience.


Part 3: Exploratory data analysis

We start our analysis with a high level overview of the data frame provided. The table below shows the number of movies of three major types (Documentary, Feature Film, and TV Movie) in the data set, the time range of theatre releases (thtr_rel_year), as well as the median audience rating on both IMDB (imdb_rating) and Rotten Tomatoes (audience_score). We choose median instead of mean for the audience rating as we suspect the rating distributions are skewed and median is a more robust statistics for skewed distributions.

title_type n() min(thtr_rel_year) max(thtr_rel_year) median(imdb_rating) median(audience_score)
Documentary 55 1970 2012 7.7 86
Feature Film 591 1972 2014 6.5 62
TV Movie 5 1993 2012 7.3 75

Table 1. Movies statistics grouped by the tytle type

We can see the majority of movies in the data set belongs to the ‘Feature Film’ group. 55 movies are Documentaries and only 5 are TV Movies.

It’s expected that the popularity of movies in these three major groups would depend on very different factors. This partially confirmed by the significant difference between median rating of Featured Films and two other groups. Thus, we decided to deal only with the most populous group of movies in this data set, Featured Films. We then create a subset (ff for Feature Film) of the data frame and use it from now:

Let’s take a look at the data summary now (we exclude some variables from the summary to make the report shorter):

##                 genre        runtime       mpaa_rating 
##  Drama             :301   Min.   : 68.0   G      : 16  
##  Comedy            : 85   1st Qu.: 93.0   NC-17  :  2  
##  Action & Adventure: 65   Median :104.0   PG     :110  
##  Mystery & Suspense: 59   Mean   :106.6   PG-13  :130  
##  Horror            : 23   3rd Qu.:116.0   R      :317  
##  Other             : 15   Max.   :202.0   Unrated: 16  
##  (Other)           : 43                                
##                               studio    thtr_rel_year  thtr_rel_month  
##  Paramount Pictures              : 37   Min.   :1972   Min.   : 1.000  
##  Warner Bros. Pictures           : 29   1st Qu.:1990   1st Qu.: 4.000  
##  Sony Pictures Home Entertainment: 26   Median :1999   Median : 7.000  
##  Universal Pictures              : 23   Mean   :1997   Mean   : 6.783  
##  Warner Home Video               : 19   3rd Qu.:2006   3rd Qu.:10.000  
##  (Other)                         :451   Max.   :2014   Max.   :12.000  
##  NA's                            :  6                                  
##   thtr_rel_day    dvd_rel_year  dvd_rel_month     dvd_rel_day   
##  Min.   : 1.00   Min.   :1991   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000   1st Qu.: 7.00  
##  Median :15.00   Median :2003   Median : 6.000   Median :15.00  
##  Mean   :14.48   Mean   :2004   Mean   : 6.304   Mean   :15.06  
##  3rd Qu.:21.50   3rd Qu.:2007   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :31.00   Max.   :2015   Max.   :12.000   Max.   :31.00  
##                  NA's   :6      NA's   :6        NA's   :6      
##   imdb_rating    imdb_num_votes           critics_rating critics_score   
##  Min.   :1.900   Min.   :   390   Certified Fresh:116    Min.   :  1.00  
##  1st Qu.:5.850   1st Qu.:  6276   Fresh          :172    1st Qu.: 31.00  
##  Median :6.500   Median : 17934   Rotten         :303    Median : 57.00  
##  Mean   :6.387   Mean   : 62861                          Mean   : 54.78  
##  3rd Qu.:7.100   3rd Qu.: 66112                          3rd Qu.: 79.00  
##  Max.   :9.000   Max.   :893008                          Max.   :100.00  
##                                                                          
##  audience_rating audience_score  best_pic_nom best_pic_win best_actor_win
##  Spilled:273     Min.   :11.00   no :569      no :584      no :500       
##  Upright:318     1st Qu.:44.50   yes: 22      yes:  7      yes: 91       
##                  Median :62.00                                           
##                  Mean   :60.47                                           
##                  3rd Qu.:78.00                                           
##                  Max.   :97.00                                           
##                                                                          
##  best_actress_win best_dir_win top200_box
##  no :521          no :548      no :576   
##  yes: 70          yes: 43      yes: 15   
##                                          
##                                          
##                                          
##                                          
## 

The variables of most interest are related to movies rating: audience rating on IMDB (imdb_rating), critics score on Rotten Tomatoes (critics_score), and audience score on Rotten Tomatoes (audience_score). We can see that median statistics for all three variables are higher than respective mean statistics. This proves our early assumption of the skeweness of the rating distributions. In fact, all three distributions are left skewed. This can also be demonstrated with the following histograms:

Interesting that IMDB rating shows a unimodal left skewed distribution centered around 6.4 (Plot 1) but both scores on Rotten Tomatoes show almost a uniform distribution with a slight left skew and no apparent centers (Plot 2 and 3). This discrepancy could be explained by the larger number of voting users on IMDb so the true population mean is more obvious. But this is questionable as we don’t have data on the number of votes for Rotten Tomatoes scores.

We can also conclude that IMDB users are less likely to give a low rating to a movie than Rotten Tomatoes users and critics. Indeed, 75% of movies received a rating of 58.5 (or 5.85 on a scale of 1 to 10) or higher on IMDB, 44.5 or higher from the audience on RT, and only 31 or higher from the critics on RT.

However, the most plausible cause of the diferencies in distributions lies in methods of calculating rating and scores on two platforms:

Basically, the Rotten Tomatoes score only counts positive rates while IMDb rating counts all rates.

For example, an audience score of 25% on Rotten Tomatoes can be translated to an IMDb rating between 2.5 and 7.5 depending on individual ratings that made up that 25% score on RT. In other words, in the plots above we do not compare apples to apples (or we should say tomatoes to tomatoes).

Let’s take a look at the median statistics for feature films grouped by genre:

genre total IMDb RT_critics RT_audience
Action & Adventure 65 6.00 33 52.0
Animation 9 6.40 48 65.0
Art House & International 14 6.50 52 65.5
Comedy 85 5.70 36 49.0
Documentary 3 6.90 40 74.0
Drama 301 6.80 67 70.0
Horror 23 5.90 40 43.0
Musical & Performing Arts 8 7.25 80 82.5
Mystery & Suspense 59 6.50 60 54.0
Other 15 7.00 72 74.0
Science Fiction & Fantasy 9 5.90 67 47.0

Table 2. Feature Films rating and scores medians by genre

We can see that Drama is the most populous category with more than half movies of the data set and median IMDb rating of 6.80, RT critics score of 67, and RT audience score of 70. The highest median IMDb rating of 7.25 is in Musical & Performing Arts category. This group also has the highest scores of all on RT among both critics and audience. We need to note that only 8 movies fall in this group, so this stats can be quite different for a larger sample. The Comedy group has the lowest median IMDb rating of all, 5.70, second to lowest for RT critics and third to lowest for RT audience.

It’s interesting to see the distribution of movies runtime:

Here we can see a right skewed unimodal distribution centered around 100 minutes with several outliers.


Part 4: Modeling

The data set contains several variables on movies: title, runtime, date of release, production company, cast, nominations, ratings, and scores. We’d love to predict the popularity of the movie (represented by the IMDb rating) based on a certain combination of the rest of the variables using a multiple linear regression method.

We will start with modelling where response variable is IMDb rating (imdb_rating).

IMDb rating prediction model

Developing a full model as a reference to all future models would be a good first step. The full model includes all potential predictor variables. However, we would omit some of the variables as they are only in data set for informational purposes and do not make any sense to include in a statistical analysis.

Here’s a list of variables we can ommit:

  • The information in the director, and actor1 through actor5 variables was used to determine whether the movie casts a director, an actor or actress who won Oscar.
  • Variable like imdb_url and rt_url obviously cannot be associated with the popularity of a movie, thus should be ommited.
  • Wording of a title of the movie by itself is meaningless for answering the research question. But we can try using a title length as a predictor.
  • We can omit the title_type variable as we only work with Feature Film here.
  • The actual day of the month of the theatrical (thtr_rel_day) or dvd (dvd_rel_day) release is also meanengless for predicting movie’s popularity. We can think of some correlation between a month or year and a movie popularity. We suspect it’s going to be low if any, though.
  • We should also omit critics_rating and audience_rating variables as they are basically the derivatives of critics_score and audience_score variables respectively. So, they are obviously collinear and adding more than one of these variables to the model would not add much value to the model.
  • We suspect a collinearity between best_pic_nom and best_pic_win variables. Obviously, movie can’t win Oscar without being nominated. We should remove one of the variables, say, best_pic_win then.

The question is whether we should use Rotten Tomatoes scores in predicting IMDb rating and vice versa. Technically speaking, a popular movie would rate high on both platforms (it’s not always the case, though). This means that rating on IMDb would have a positive correlation with audience score on RT despite the fact they are calculated differently. We can check if this is the case by calculating a correlation coefficient:

## [1] 0.8487537

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     36.5493       0.4519  
## 
## Sum of Squares:  18380.56
## integer(0)

Indeed, we can see that the correlation coefficient is very high (~0.85) and positive. However, the variance doesn’t seem to be constant (little variance for popular movies and larger variance for unpopular movies) for the reason discussed in Part 3. There’s definitely a correlation between these two variables but it’s not linear.

The previous assumption is even more apparent with RT critics score and critics rating variables plotted against IMDb audience rating (Plots 10-13):

The funnels we see is a perfect illustration of non-constant variability described earlier. Thus we shouldn’t be using RT scores in predicting IMDb rating using multiple linear regression.

The full model should then look like this:

## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_year + 
##     thtr_rel_month + dvd_rel_year + dvd_rel_month + imdb_num_votes + 
##     best_pic_nom + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = na.omit(ff))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8520 -0.4062  0.0699  0.5605  2.0753 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.871e+01  1.701e+01   2.863 0.004358 ** 
## genreAnimation                 -2.379e-01  3.530e-01  -0.674 0.500658    
## genreArt House & International  9.488e-01  2.802e-01   3.386 0.000761 ***
## genreComedy                    -3.070e-02  1.450e-01  -0.212 0.832405    
## genreDocumentary                1.029e+00  5.070e-01   2.030 0.042886 *  
## genreDrama                      6.849e-01  1.247e-01   5.494 6.04e-08 ***
## genreHorror                    -9.927e-02  2.177e-01  -0.456 0.648561    
## genreMusical & Performing Arts  1.186e+00  3.200e-01   3.708 0.000230 ***
## genreMystery & Suspense         4.323e-01  1.625e-01   2.660 0.008038 ** 
## genreOther                      4.802e-01  2.486e-01   1.932 0.053930 .  
## genreScience Fiction & Fantasy -8.811e-02  3.179e-01  -0.277 0.781739    
## runtime                         5.149e-03  2.620e-03   1.965 0.049896 *  
## mpaa_ratingNC-17               -2.281e-01  8.845e-01  -0.258 0.796598    
## mpaa_ratingPG                  -6.047e-01  2.600e-01  -2.325 0.020413 *  
## mpaa_ratingPG-13               -8.087e-01  2.708e-01  -2.986 0.002952 ** 
## mpaa_ratingR                   -5.297e-01  2.628e-01  -2.015 0.044346 *  
## mpaa_ratingUnrated              1.343e-01  3.551e-01   0.378 0.705479    
## thtr_rel_year                  -1.095e-02  4.817e-03  -2.274 0.023364 *  
## thtr_rel_month                  7.542e-03  1.068e-02   0.706 0.480371    
## dvd_rel_year                   -1.062e-02  1.052e-02  -1.009 0.313386    
## dvd_rel_month                   2.041e-02  1.069e-02   1.909 0.056754 .  
## imdb_num_votes                  3.500e-06  3.682e-07   9.504  < 2e-16 ***
## best_pic_nomyes                 3.665e-01  2.040e-01   1.797 0.072954 .  
## best_actor_winyes               1.472e-02  1.040e-01   0.142 0.887487    
## best_actress_winyes            -1.733e-02  1.141e-01  -0.152 0.879318    
## best_dir_winyes                 2.011e-01  1.412e-01   1.424 0.155037    
## top200_boxyes                  -9.112e-02  2.378e-01  -0.383 0.701758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.841 on 546 degrees of freedom
## Multiple R-squared:  0.3896, Adjusted R-squared:  0.3605 
## F-statistic:  13.4 on 26 and 546 DF,  p-value: < 2.2e-16

The full model includes all variables that we believe meaningful. It has an adjusted R2 of summary(imdb_full)$adj.r.squared which means that the model explains aproximately 36% of variance in the respose variable (imdb_rating in this case). Let’s see if we can come up with a model of higher predicting power by changing a combination of predictors.

We’ll be using a forward selection technique here. The forward selection model starts with an empty model. Then, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by adjusted R2).

## 
## Call:
## lm(formula = bestformula, data = ff_imdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8427 -0.4093  0.0707  0.5695  2.0833 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.851e+01  1.688e+01   2.873 0.004222 ** 
## imdb_num_votes                  3.459e-06  3.537e-07   9.779  < 2e-16 ***
## genreAnimation                 -2.238e-01  3.497e-01  -0.640 0.522592    
## genreArt House & International  9.483e-01  2.786e-01   3.404 0.000713 ***
## genreComedy                    -2.578e-02  1.434e-01  -0.180 0.857393    
## genreDocumentary                1.020e+00  5.052e-01   2.019 0.043925 *  
## genreDrama                      6.829e-01  1.228e-01   5.558 4.25e-08 ***
## genreHorror                    -9.543e-02  2.167e-01  -0.440 0.659819    
## genreMusical & Performing Arts  1.193e+00  3.186e-01   3.745 0.000200 ***
## genreMystery & Suspense         4.269e-01  1.594e-01   2.678 0.007637 ** 
## genreOther                      4.683e-01  2.467e-01   1.898 0.058179 .  
## genreScience Fiction & Fantasy -9.631e-02  3.166e-01  -0.304 0.761131    
## thtr_rel_year                  -1.062e-02  4.774e-03  -2.225 0.026497 *  
## mpaa_ratingNC-17               -1.778e-01  8.796e-01  -0.202 0.839905    
## mpaa_ratingPG                  -5.916e-01  2.581e-01  -2.293 0.022245 *  
## mpaa_ratingPG-13               -8.009e-01  2.683e-01  -2.985 0.002958 ** 
## mpaa_ratingR                   -5.131e-01  2.597e-01  -1.976 0.048673 *  
## mpaa_ratingUnrated              1.471e-01  3.526e-01   0.417 0.676637    
## runtime                         5.503e-03  2.472e-03   2.226 0.026401 *  
## best_pic_nomyes                 3.846e-01  1.992e-01   1.931 0.054054 .  
## dvd_rel_month                   1.902e-02  1.048e-02   1.814 0.070152 .  
## best_dir_winyes                 2.028e-01  1.407e-01   1.441 0.150160    
## dvd_rel_year                   -1.084e-02  1.043e-02  -1.039 0.299084    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.3644 
## F-statistic: 15.91 on 22 and 550 DF,  p-value: < 2.2e-16

As a result, we come up with a model imdb_rating ~ + imdb_num_votes + genre + thtr_rel_year + mpaa_rating + runtime + best_pic_nom + dvd_rel_month + best_dir_win + dvd_rel_year with adjusted R2 of 0.3643941 which is slightly higher than that of a full model (0.3605187). However, we can see that some of the variables have a p-value above the significance level of 5%. The order of certain variable in the formula tells us about its impact on the adjusted R2 (the earlier in the formula, the greater the impact).

Let’s try a backward elimination technique and compare the resulting model with the model above. Backward elimination starts with the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted R2. At each step we eliminate the variable that leads to the largest improvement in adjusted R2.

## 
## Call:
## lm(formula = bwrd_bestformula, data = ff_imdb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8427 -0.4093  0.0707  0.5695  2.0833 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.851e+01  1.688e+01   2.873 0.004222 ** 
## genreAnimation                 -2.238e-01  3.497e-01  -0.640 0.522592    
## genreArt House & International  9.483e-01  2.786e-01   3.404 0.000713 ***
## genreComedy                    -2.578e-02  1.434e-01  -0.180 0.857393    
## genreDocumentary                1.020e+00  5.052e-01   2.019 0.043925 *  
## genreDrama                      6.829e-01  1.228e-01   5.558 4.25e-08 ***
## genreHorror                    -9.543e-02  2.167e-01  -0.440 0.659819    
## genreMusical & Performing Arts  1.193e+00  3.186e-01   3.745 0.000200 ***
## genreMystery & Suspense         4.269e-01  1.594e-01   2.678 0.007637 ** 
## genreOther                      4.683e-01  2.467e-01   1.898 0.058179 .  
## genreScience Fiction & Fantasy -9.631e-02  3.166e-01  -0.304 0.761131    
## runtime                         5.503e-03  2.472e-03   2.226 0.026401 *  
## mpaa_ratingNC-17               -1.778e-01  8.796e-01  -0.202 0.839905    
## mpaa_ratingPG                  -5.916e-01  2.581e-01  -2.293 0.022245 *  
## mpaa_ratingPG-13               -8.009e-01  2.683e-01  -2.985 0.002958 ** 
## mpaa_ratingR                   -5.131e-01  2.597e-01  -1.976 0.048673 *  
## mpaa_ratingUnrated              1.471e-01  3.526e-01   0.417 0.676637    
## thtr_rel_year                  -1.062e-02  4.774e-03  -2.225 0.026497 *  
## dvd_rel_year                   -1.084e-02  1.043e-02  -1.039 0.299084    
## dvd_rel_month                   1.902e-02  1.048e-02   1.814 0.070152 .  
## imdb_num_votes                  3.459e-06  3.537e-07   9.779  < 2e-16 ***
## best_pic_nomyes                 3.846e-01  1.992e-01   1.931 0.054054 .  
## best_dir_winyes                 2.028e-01  1.407e-01   1.441 0.150160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8385 on 550 degrees of freedom
## Multiple R-squared:  0.3888, Adjusted R-squared:  0.3644 
## F-statistic: 15.91 on 22 and 550 DF,  p-value: < 2.2e-16

We ended up with the same model using backward elimination technique. We should run a diagnostic for the following predictors included in a model:

genre, runtime, mpaa_rating, thtr_rel_year, dvd_rel_year, dvd_rel_month, imdb_num_votes, best_pic_nom, best_dir_win

Model diagnostics

To assess whether the multiple regression model is reliable, we need to check for:

  1. nearly normal residuals,
  2. constant variability of residuals,
  3. the residuals are independent, and
  4. each variable is linearly related to outcome.

Let’s take a look ath the Normal probability plot of residuals for our model.

We can see some fluctuations from a normal model. They are not extreme, though. There are no outliers that might be cause of concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.

Absolute values of residuals against fitted values. The following plot is helpful to check the condition that the variance of the residuals is approximately constant.

We don’t see any obvious deviations from constant variance in this plot. As there are not many movies with IMDb ratings around 8, 9, or 10 in the data set, we don’t know if the variance for these values remains constant. However, there is no evedence that it doesn’t based on the data we have.

Residuals in order of their data collection. Such a plot is helpful in identifying any connection between cases that are close to one another. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail. We know that the data set represents a random selection of movies, so we don’t expect any problems here.

As expected, here we see no structure that indicates a problem.

Last thing we need to check is Residuals against each predictor variable. The first row of the graphics below shows the residual plots:

Let’s look at the categorical variables first. For the genre variable we see a lot of deviations from constant variability between different genres here. The variability among the groups in mpaa_rating seems a little bit more constant. For example, Unrated movies (there are only 16 of them in a data frame) have significantly less variability than others and NC-17 has no variability at all (in fact, there’s only one movie of this category in a data set). Besides that, this variable look ok. Both best_pic_nom, best_dir_win variables show the inconstant variability (the latter in a lesser extent, though). On the one hand this can be explained by a significantly smaller number of movies or directors that have been nominated for Oscar. However, the reals reason is not clear from the data.

The numerical variables show more uniformity overall. With theatrical release year (thtr_rel_year) and DVD release year (dvd_rel_year) we don’t see a structure (distribution around 0 seems to be random). There might be some remaining structure in the DVD release month variable, though. We can see a little ‘wave’ going up and down here. There’s a clear ‘funneling’ in both runtime and imdb_num_votes variables. For instance, in the IMDb number of votes, prediction for the movies with lower number of votes has lower accuracy (larger difference between predicted and observed value) than for those with large number of votes - this is not something unexpected as point estimate is getting closer to the actual population parameter as the sample size increases. It’s harder to explain why the accuracy of the model goes down for the shorter movies, though.

It’s worth mentioning that the following variables have the p-value above the significance level of 5%: best_dir_win, best_pic_nom, dvd_rel_year, and dvd_rel_month.

Interpretation of model coefficients

Table below shows the coefficients (or point estimates of the coefficients) for each predictor in our model.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.5063482 16.8831814 2.8730573 0.0042222
genreAnimation -0.2237513 0.3497397 -0.6397652 0.5225916
genreArt House & International 0.9482872 0.2785873 3.4039143 0.0007127
genreComedy -0.0257799 0.1433986 -0.1797776 0.8573934
genreDocumentary 1.0201916 0.5051834 2.0194481 0.0439252
genreDrama 0.6828532 0.1228487 5.5584900 0.0000000
genreHorror -0.0954325 0.2166944 -0.4404014 0.6598194
genreMusical & Performing Arts 1.1931570 0.3186205 3.7447588 0.0001996
genreMystery & Suspense 0.4268653 0.1594225 2.6775726 0.0076373
genreOther 0.4682855 0.2466858 1.8983072 0.0581788
genreScience Fiction & Fantasy -0.0963062 0.3166432 -0.3041474 0.7611306
runtime 0.0055027 0.0024717 2.2262654 0.0264007
mpaa_ratingNC-17 -0.1777673 0.8795546 -0.2021106 0.8399050
mpaa_ratingPG -0.5916361 0.2580573 -2.2926539 0.0222446
mpaa_ratingPG-13 -0.8008963 0.2682718 -2.9853911 0.0029582
mpaa_ratingR -0.5131497 0.2597135 -1.9758299 0.0486734
mpaa_ratingUnrated 0.1471493 0.3526396 0.4172796 0.6766367
thtr_rel_year -0.0106217 0.0047742 -2.2248402 0.0264969
dvd_rel_year -0.0108412 0.0104305 -1.0393807 0.2990844
dvd_rel_month 0.0190193 0.0104821 1.8144594 0.0701517
imdb_num_votes 0.0000035 0.0000004 9.7788640 0.0000000
best_pic_nomyes 0.3845828 0.1992098 1.9305415 0.0540535
best_dir_winyes 0.2027599 0.1407099 1.4409774 0.1501601

In this case Intercept is a point where regression line intercepts the y-axis represented by imdb_rating. The coefficient of 0.0055 for runtime variable means that the average difference in IMDb rating for each additional minute of a runtime is +0.0055 when holding the other variables constant.

We can see that some of the variables have larger impact on the IMDb rating. For example, if MPAA rating of a movie is PG-13 (“Parental Guidance: some material may be inappropriate for children 13 and under”) reduces the IMDb rating of the movie by 0.8. A movie in Musical & Performing Arts category increses the rating by 1.19. It’s a largest single factor.

The number of votes on IMDb (imdb_num_votes) seems to be having the least impact on the movie’s IMDb rating: each additional vote adds up only 0.0000035 to the rating. A movie needs a million votes to see a rating increased by 3.5. Considering that a movie in our data set gets on average only 62,861 votes the impact on the rating is not very large (0.22 on average).

As expected, both the DVD release year and month have just a tiny impact on IMDb rating (if any, considering their p-values above the significance level). The Best picture nomination adds up 0.38 to the rating.


Part 5: Prediction

To check our model’s prediction accuracy we picked a movie that is not in the initial data set and that was released in 2016. This movie is “La La Land” (IMDb link: www.imdb.com/title/tt3783958/). As of December 17, 2019 it has an IMDb rating of 8.0 and 456,399 votes. The movie and its director, Damiene Chazelle, were both nominated and won Oscars in 2017. The rest of the parameters has been combined in a data frame lalaland.

We use a predict function and use both our model and the lalaland data as an input:

##        fit      lwr      upr
## 1 7.345815 5.621457 9.070173

The model predicts an IMDb rating of 7.35 for this movie. The actual rating is 8.0. This means that model underestimated real rating. The lwr and upr values show lower and upper bounds of a confidence interval of the prediction with a confidence level of 95%. We’re 95% confident that the actual IMDb rating of this movie is between 5.62 and 9.07. The interval is quite wide. This in fact agrees with the adjusted R2 value of 0.3644 meaning that only 36.44% of the variation in rating is explained by the model.

We use a movie released in 2016 despite the fact that our model is based on data about movies released before 2016. This implyes an extrapolation which makes our prediction even less reliable.


Part 6: Conclusion

In this research, we used factual data on 600+ movies and built a multiple regression model that predicts IMDb rating. The Adjusted R2 was used as estimate of explained variance. We tried two techniques of stepwise model selection (forward selection and backward elimination) which came up with the same model that explains approximately 36.44% of variation in movie popularity measured by IMDb rating.

Answering the research question, we can predict the popularity of a movie without watching it but to a certain limited extent.

Fortunately for us, movie lovers, the popularity of a certain piece of art is based on something that is hard to measure and categorize. And this ‘something’ explains a good part of those remaining 63.56%.